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Amendment to the Claims 



This listing of claims will replace all prior versions, and listings, of claims in the 
application: 

Listing of Claims: 



1 1 . (currently amended): A system for grouping clusters of 

2 semantically scored documents electronically stored in a data corpus, comprising: 

3 a scoring module determining a score, which is assigned to at least one 

4 concept that has been extracted from a plurality of electronically-stored 

5 documents, wherein the score is based on at least one of a frequency of 

6 occurrence of the at least one concept within at least one such document, a 

7 concept weight, a structural weight, and a corpus w e ight; weight, forming the 

8 score assigned to the at least one concept as a normalized score vector for each 

9 such document, and determining a similarity between the normalized score vector 

10 for each such document as an inner product of each normalized score vector; 

1 1 a clustering module forming clusters of the documents by e valuating th e 

12 scor e for th e at l e ast on e conc e pt of e ach document for a b e st fit to th e clust e rs 

13 and assigning each docum e nt to th e clust e r with th e b e st fit; and , comprising: 

14 a selection submodule evaluating a set of candidate seed 

15 documents selected from the plurality of documents; 

16 a seed document identification submodule identifying a set of seed 

17 documents by applying the similarity as a best fit to each such candidate seed 

18 document; 

19 a non-seed document identification submodule identifying a 

20 plurality of non-seed documents; 

21 a comparison submodule determining the similarity between each 

22 non-seed document and a center of each cluster; and 

23 a clustering submodule grouping each such non-seed document 

24 into a cluster with the best fit, subject to a minimum fit; 
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25 a threshold module determining similariti e s the similarity between each of 

26 the documents grouped into each cluster based on the center of the cluster and the 

27 scores assigned to each of the at least one concepts in e ach such that document, 

28 dynamically determining a threshold for each cluster as a function of the 

29 similariti e s similarity between each of the documents , and identifying and 

30 reassigning thos e each of the documents having the similariti e s similarity falling 

3 1 outside the threshold. 



1 2. (original): A system according to Claim 1 5 further comprising: 

2 the scoring module calculating the score as a function of a summation of 

3 at least one of the frequency of occurrence, the concept weight, the structural 

4 weight, and the corpus weight of the at least one concept. 

1 3. (original): A system according to Claim 2, further comprising: 

2 a compression module compressing the score through logarithmic 

3 compression. 

1 4. (original): A system according to Claim 1 , further comprising: 

2 the scoring module calculating the concept weight as a function of a 

3 number of terms comprising the at least one concept. 

1 5. (original): A system according to Claim 1, further comprising: 

2 the scoring module calculating the structural weight as a function of a 

3 location of the at least one concept within the at least one such document. 

1 6. (original): A system according to Claim 1, further comprising: 

2 the scoring module calculating the corpus weight as a function of a 

3 reference count of the at least one concept over the plurality of documents. 

1 Claims 7-8 (canceled). 

1 9. (currently amended): A method for grouping clusters of 

2 semantically scored documents electronically stored in a data corpus, comprising: 
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3 determining a score, which is assigned to at least one concept that has 

4 been extracted from a plurality of electronically-stored documents, wherein the 

5 score is based on at least one of a frequency of occurrence of the at least one 

6 concept within at least one such document, a concept weight, a structural weight, 

7 and a corpus weight; 

8 forming the score assigned to the at least one concept as a normalized 

9 score vector for each such document; 

10 determining a similarity between the normalized score vector for each 

11 such document as an inner product of each normalized score vector; 

1 2 forming logically-grouped clusters of the documents by e valuating th e 

13 scor e for th e at l e ast on e conc e pt of e ach docum e nt for a b e st fit to th e clusters 

14 and assigning e ach document to th e clu s t e r with th e b e st fit; , comprising: 

15 evaluating a set of candidate seed documents selected from the 

16 plurality of documents; 

17 identifying a set of seed documents by applying the similarity as a 

18 best fit to each such candidate seed document; 

19 identifying a plurality of non-seed documents; 

20 determining the similarity between each non-seed document and a 

21 center of each cluster; and 

22 grouping each such non-seed document into a cluster with the best 

23 fit, subject to a minimum fit; 

24 determining similariti e s the similarity between each of the documents 

25 grouped into each cluster based on the center of the cluster and the scores 

26 assigned to each of the at least one concepts in e ach such that document; 

27 dynamically determining a threshold for each cluster as a function of the 

28 similariti e s similarity between each of the documents ; and 

29 identifying and reassigning thos e each of the documents having the 

30 similariti e s similarity falling outside the threshold. 

1 10. (original): A method according to Claim 9, further comprising: 
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2 calculating the score as a function of a summation of at least one of the 

3 frequency of occurrence, the concept weight, the structural weight, and the corpus 

4 weight of the at least one concept. 

1 11. (original): A method according to Claim 1 0, further comprising: 

2 compressing the score through logarithmic compression. 

1 12. (original): A method according to Claim 9, further comprising: 

2 calculating the concept weight as a function of a number of terms 

3 comprising the at least one concept. 

1 13. (original): A method according to Claim 9, further comprising: 

2 calculating the structural weight as a function of a location of the at least 

3 one concept within the at least one such document. 

1 14. (original): A method according to Claim 9, further comprising: 

2 calculating the corpus weight as a function of a reference count of the at 

3 least one concept over the plurality of documents. 

1 Claims 15-16 (canceled). 

1 1 7. (currently amended): A computer-readable storage medium 

2 holding code for grouping clusters of semantically scored documents 

3 electronically stored in a data corpus, comprising: 

4 code for determining a score, which is assigned to at least one concept that 

5 has been extracted from a plurality of electronically-stored documents, wherein 

6 the score is based on at least one of a frequency of occurrence of the at least one 

7 concept within at least one such document, a concept weight, a structural weight, 

8 and a corpus weight; 

9 code for forming the score assigned to the at least one concept as a 

10 normalized score vector for each such document; 

11 code for determining a similarity between the normalized score vector for 

12 each such document as an inner product of each normalized score vector; 
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1 3 code for forming logically-grouped clusters of the documents by 

14 e valuating th e scor e for th e at l e ast one concept of e ach docum e nt for a best fit to 

15 th e clust e rs and assigning e ach docum e nt to th e clust e r with th e b e st fit a 

16 comprising ; 

17 code for evaluating a set of candidate seed documents selected 

18 from the plurality of documents; 

19 code for identifying a set of seed documents by applying the 

20 similarity as a best fit to each such candidate seed document; 

21 code for identifying a plurality of non-seed documents; 

22 code for determining the similarity between each non-seed 

23 document and a center of each cluster; and 

24 code for grouping each such non-seed document into a cluster with 

25 the best fit, subject to a minimum fit; 

26 code for determining similariti e s the similarity between each of the 

27 documents grouped into each cluster based on the center of the cluster and the 

28 scores assigned to each of the at least one concepts in e ach such that document; 

29 code for dynamically determining a threshold for each cluster as a 

30 function of the similariti e s similarity between each of the documents ; and 

3 1 code for identifying and reassigning thos e docum e nts each of the 

32 documents having the similariti e s similarity falling outside the threshold. 

1 18. (currently amended): A system for providing efficient document 

2 scoring of concepts within and clustering of documents in an electronically-stored 

3 document set, comprising: 

4 a scoring module scoring a document in an electronically-stored document 

5 set, comprising: 

6 a frequency module determining a frequency of occurrence of at 

7 least one concept within a document; 

8 a concept weight module analyzing a concept weight reflecting a 

9 specificity of meaning for the at least one concept within the document; 



OA 2 Resp 



Response to Office Action 
Docket No. 013.0207.US.UTL 



10 a structural weight module analyzing a structural weight reflecting 

1 1 a degree of significance based on structural location within the document for the 

1 2 at least one concept; 

1 3 a corpus weight module analyzing a corpus weight inversely 

14 weighing a reference count of occurrences for the at least one concept within the 

15 document; [[and]] 

1 6 a scoring evaluation module evaluating a score to be associated 

1 7 with the at least one concept as a function of the frequency, concept weight, 

1 8 structural weight, and corpus weight; [[and]] 

19 a vector module forming the score assigned to the at least one 

20 concept as a normalized score vector for each such document in the 

21 electronically-stored document set; and 

22 a determination module determining a similarity between the 

23 normalized score vector for each such document as an inner product of each 

24 normalized score vector; 

25 a clustering module grouping the documents by the score into a plurality 

26 of clusters, comprising: 

27 a selection submodule evaluating a set of candidate seed 

28 documents selected from the electronically- stored document set; 

29 a cluster seed submodule identifying candidat e seed documents? 

30 which ar e e ach assign e d as a s ee d docum e nt into a cluster with a c e nt e r most 

31 similar to th e se e d docum e nt, and by applying the similarity as a best fit to each 

32 such candidate seed document; 

33 * an identification submodule identifying a plurality of non-seed 

34 documents; 

35 a comparison submodule determining the similarity between each 

36 non-seed document and a cluster center of each cluster; and 

37 a clustering submodule assigning each non-seed document to the 

38 cluster with the best fi t, subject to a minimum fit ; and 
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a threshold module relocating outlier documents, comprising determining 
s imilariti e s the similarity between each of the documents grouped into each 
cluster based on the center of the cluster and the scores assigned to each of the at 
least one concepts in e ach such that document, dynamically determining a 
threshold for each cluster as a function of the similariti e s similarity between each 
of the documents , and identifying and reassigning each of the documents with the 
similariti e s similarity falling outside the threshold. 

19. (previously presented): A system according to Claim 1 8, further 
comprising: 

the scoring module evaluating the score in accordance with the formula: 



where 5/ comprises the score, fij comprises the frequency, 0 < cw ; y < 1 comprises 
the concept weight, 0 < swy < 1 comprises the structural weight, and 0 < rwy < 1 
comprises the corpus weight for occurrence j of concept i. 

20. (previously presented): A system according to Claim 19, further 
comprising: 

the concept weight module evaluating the concept weight in accordance 
with the formula: 



where cwy comprises the concept weight and ty comprises a number of terms for 
occurrence j of each such concept i. 

2 1 . (previously presented): A system according to Claim 1 9, further 
comprising: 

the structural weight module evaluating the structural weight in 
accordance with the formula: 



J 



s i = T<fij xcw v xsw ij xrw ij 
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where sw,y comprises the structural weight for occurrence y of each such concept i. 

22. (previously presented): A system according to Claim 19, further 
comprising: 

the corpus weight module evaluating the corpus weight in accordance with 
the formula: 



where rwy comprises the corpus weight, r,y comprises a reference count for 
occurrence j of each such concept /, T comprises a total number of reference 
counts of documents in the document set, and M comprises a maximum reference 
count of documents in the document set. 

23. (previously presented): A system according to Claim 19, further 
comprising: 

a compression module compressing the score in accordance with the 
formula: 



24. (original): A system according to Claim 18, further comprising: 
a global stop concept vector cache maintaining concepts and terms; and 
a filtering module filtering selection of the at least one concept based on 
the concepts and terms maintained in the global stop concept vector cache. 







OA 2 Resp 



-9- 



Response to Office Action 
Docket No. 013.0207.US.UTL 



1 25. (original): A system according to Claim 1 8, further comprising: 

2 a parsing module identifying terms within at least one document in the 

3 document set, and combining the identified terms into one or more of the 

4 concepts. 

1 26. (original): A system according to Claim 25, further comprising: 

2 the parsing module structuring each such identified term in the one or 

3 more concepts into canonical concepts comprising at least one of word root, 

4 character case, and word ordering. 

1 27. (original): A system according to Claim 25, wherein at least one of 

2 nouns, proper nouns and adjectives are included as terms. 

1 Claims 28-30 (canceled). 

1 31. (currently amended): A system according to Claim 30 Claim 18 , 

2 further comprising: 

3 the similarity modul e submodule calculating the similarity in accordance 

4 with the formula: 

5 cos cr^ = ■ _ || _» I 

6 where coscr^ comprises a similarity between a document A and a document B, 

7 S A comprises a score vector for document A, and S B comprises a score vector for 

8 document B. 

1 Claims 32-34 (canceled). 

1 35. (currently amended): A method for providing efficient document 

2 scoring of concepts within and clustering of documents in an electronically-stored 

3 document set, comprising: 

4 scoring a document in an electronically-stored document set, comprising: 
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5 determining a frequency of occurrence of at least one concept 

6 within a document; 

7 analyzing a concept weight reflecting a specificity of meaning for 

8 the at least one concept within the document; 

9 analyzing a structural weight reflecting a degree of significance 

1 0 based on structural location within the document for the at least one concept; 

1 1 analyzing a corpus weight inversely weighing a reference count of 

12 occurrences for the at least one concept within the document; and 

1 3 evaluating a score to be associated with the at least one concept as 

14 a function of the frequency, concept weight, structural weight, and corpus weight; 

15 [[and]] 

16 forming the score assigned to the at least one concept as a normalized 

17 score vector for each such document in the electronically-stored document set; 

18 determining a similarity between the normalized score vector for each 

19 such document as an inner product of each normalized score vector; 

20 grouping the documents by the score into a plurality of clusters, 

21 comprising: 

22 evaluating a set of candidate seed documents selected from the 

23 electronically-stored document set; 

24 identifying candidat e seed documents , which ar e e ach assign e d as 

25 a s ee d docum e nt into a clust e r with a c e nt e r most similar to th e s ee d docum e nt by 

26 applying the similarity as a best fit to each such candidate seed document ; 

27 identifying a plurality of non-seed documents; 

28 determining the similarity between each non-seed document and a 

29 center of each cluster; and 

30 assigning each non-seed document to the cluster with the best fit a 

31 subject to a minimum fit ; and 

32 relocating outlier documents, comprising: 
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determining similariti e s the similarity between each of the 



documents grouped into each cluster based on the center of the cluster and the 
scores assigned to each of the at least one concepts in e ach such that document; 

dynamically determining a threshold for each cluster as a function 
of the similarities similarity between each of the documents ; and 

identifying and reassigning each of the documents with the 
similariti e s similarity falling outside the threshold. 

36. (previously presented): A method according to Claim 35, further 
comprising: 

evaluating the score in accordance with the formula: 

J 



where Si comprises the score, fy comprises the frequency, 0 < cwy < 1 comprises 
the concept weight, 0 < swy < 1 comprises the structural weight, and 0 < rwy < 1 
comprises the corpus weight for occurrence j of concept i. 

37. (previously presented): A method according to Claim 36, further 
comprising: 

evaluating the concept weight in accordance with the formula: 



where cwy comprises the concept weight and tg comprises a number of terms for 
occurrence j of each such concept i. 

38. (previously presented): A method according to Claim 36, further 
comprising: 

evaluating the structural weight in accordance with the formula: 



Si = y ZfiJ xcw U xsw u xrw u 
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where SWy comprises the structural weight for occurrence j of each such concept /. 

39. (previously presented): A method according to Claim 36, further 
comprising: 

evaluating the corpus weight in accordance with the formula: 



where rw,y comprises the corpus weight, r,y comprises a reference count for 
occurrence j of each such concept i, T comprises a total number of reference 
counts of documents in the document set, and M comprises a maximum reference 
count of documents in the document set. 

40. (previously presented): A method according to Claim 36, further 
comprising: 

compressing the score in accordance with the formula: 



where S\ comprises the compressed score for each such concept i. 

41 . (original): A method according to Claim 35, further comprising: 
maintaining concepts and terms in a global stop concept vector cache; and 
filtering selection of the at least one concept based on the concepts and 

terms maintained in the global stop concept vector cache. 

42. (original): A method according to Claim 35, further comprising: 
identifying terms within at least one document in the document set; and 




r i} >M 



1.0, 




s; = iog(s f .+i) 
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3 combining the identified terms into one or more of the concepts. 

1 43. (original): A method according to Claim 42, further comprising: 

2 structuring each such identified term in the one or more concepts into 

3 canonical concepts comprising at least one of word root, character case, and word 

4 ordering. 

1 44. (original): A method according to Claim 42, further comprising: 

2 including as terms at least one of nouns, proper nouns and adjectives. 

1 Claims 45-47 (canceled). 

1 48. (currently amended): A method according to Claim 17 Claim 35 , 

2 further comprising: 

3 calculating the similarity in accordance with the formula: 

4 cosa^ = v ' 

5 where cos a AB comprises a similarity between a document A and a document B y 

6 S A comprises a score vector for document A, and S B comprises a score vector for 

7 document B. 

1 Claims 49-51 (canceled). 

1 52. (currently amended): A computer-readable storage medium 

2 holding code for providing efficient document scoring of concepts within and 

3 clustering of documents in an electronically-stored document set, comprising: 

4 code for scoring a document in an electronically-stored document set, 

5 comprising: 

6 code for determining a frequency of occurrence of at least one 

7 concept within a document; 

8 code for analyzing a concept weight reflecting a specificity of 

9 meaning for the at least one concept within the document; 
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10 code for analyzing a structural weight reflecting a degree of 

1 1 significance based on structural location within the document for the at least one 

12 concept; 

1 3 code for analyzing a corpus weight inversely weighing a reference 

14 count of occurrences for the at least one concept within the document; and 

1 5 code for evaluating a score to be associated with the at least one 

16 concept as a function of the frequency, concept weight, structural weight, and 

17 corpus weight; [[and]] 

18 code for forming the score assigned to the at least one concept as a 

19 normalized score vector for each such document in the electronically-stored 

20 document set; 

21 code for determining a similarity between the normalized score vector for 

22 each such document as an inner product of each normalized score vector; 

23 code for grouping the documents by the score into a plurality of clusters, 

24 comprising: 

25 code for evaluating a set of candidate seed documents selected 

26 from the electronically-stored document set; 

27 code for identifying candidat e seed documents , which ar e e ach 

28 assign e d as a s ee d docum e nt into a clust e r with a c e nt e r most s imilar to th e s ee d 

29 docum e nt by applying the similarity as a best fit to each such candidate seed 

30 document ; 

31 code for identifying a plurality of non-seed documents; 

32 code for determining the similarity between each non-seed 

33 document and a center of each cluster; and 

34 code for assigning each non-seed document to the cluster with the 

35 best fit , subject to a minimum fit ; and 

36 code for relocating outlier documents, comprising: 

37 code for determining similariti e s the similarity between each of the 

38 documents grouped into each cluster based on the center of the cluster and the 

39 scores assigned to each of the at least one concepts in e ach such that document; 
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40 code for dynamically determining a threshold for each cluster as a 

41 function of the similariti e s similarity between each of the documents ; and 

42 code for identifying and reassigning each of the documents with 

43 the similariti e s similarity falling outside the threshold. 

1 53. (currently amended): An apparatus for providing efficient 

2 document scoring of concepts within and clustering of documents in an 

3 electronically-stored document set, comprising: 

4 means for scoring a document in an electronically-stored document set, 

5 comprising: 

6 means for determining a frequency of occurrence of at least one 

7 concept within a document; 

8 means for analyzing a concept weight reflecting a specificity of 

9 meaning for the at least one concept within the document; 

10 means for analyzing a structural weight reflecting a degree of 

1 1 significance based on structural location within the document for the at least one 

12 concept; 

1 3 means for analyzing a corpus weight inversely weighing a 

14 reference count of occurrences for the at least one concept within the document; 

15 and 

16 means for evaluating a score to be associated with the at least one 

1 7 concept as a function of the frequency, concept weight, structural weight, and 

1 8 corpus weight; [[and]] 

19 means for forming the score assigned to the at least one concept as a 

20 normalized score vector for each such document in the electronically-stored 

21 document set; 

22 means for determining a similarity between the normalized score vector 

23 for each such document as an inner product of each normalized score vector; 

24 means for grouping the documents by the score into a plurality of clusters, 

25 comprising: 
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26 means for evaluating a set of candidate seed documents selected 

27 from the electronically-stored document set; 

28 means for identifying candidat e seed documents , which ar e e ach 

29 assign e d as a s ee d docum e nt into a clust e r with a c e nt e r most similar to th e s ee d 

30 docum e nt by applying the similarity as a best fit to each such candidate seed 

31 document ; 

32 means for identifying a plurality of non-seed documents; 

33 means for determining the similarity between each non-seed 

34 document and a center of each cluster; and 

35 means for assigning each non-seed document to the cluster with 

36 the best fi t, subject to a minimum fit ; and 

37 means for relocating outlier documents, comprising: 

38 means for determining similariti e s the similarity between each of 

39 the documents grouped into each cluster based on the center of the cluster and the 

40 scores assigned to each of the at least one concepts in each such that document; 

41 means for dynamically determining a threshold for each cluster as 

42 a function of the similariti e s similarity between each of the documents ; and 

43 means for identifying and reassigning each of the documents with 

44 the similariti e s similarity falling outside the threshold. 
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